It is highly recommanded to follow the theorical session before start this training.
The aim of this session will be explore several methods/packages to eval estimates usually wanted in MR studies.
In the following session, under R 4.2.1, we will use following R packages :
Warning : Loading {TwoSampleMR} leads
to conflict with {MendelianRandomization} masking
mr_ivw and mr_median functions.
It is also worth noting the function dat_to_MRInput in the
{TwoSampleMR} package that will convert from the
TwoSampleMR format to the
MendelianRandomization format.
You have to open the Rmd file MR_Practical_Session.Rmd
to access to not-shown-chunk named dataset, so you will get
the data by copy and past the objects (coursedata,
ratio.all and diabetes_data) in your R
Console.
Warning: the dataset we gave you was already harmonised - all the associations were matched up to be on the same strand, and going in the same direction. In reality, you will often need to check that these alignments are correct before using the data.
MR-Base web-app can do this for you automatically under certain set rules (e.g. remove all palindromic snps with a MAF>=0.3), but we highly highly recommend doing this by hand if you are unfamiliar with the process and the choices being made.
## load coursedata
## load ratio.all
## load diabetes_data
To apply the ratio method, we will use the dataset
coursedata (individual level data) providing phenotypes
(xand y) and genotypes (g
columns).
head(coursedata)
## ID g1 g2 g3 g4 x y y.bin
## 1 1 1 0 0 1 0.51141400 -1.2970720 1
## 2 2 0 0 0 0 -1.05450500 0.9117613 0
## 3 3 1 1 0 1 -0.56296820 1.2791060 0
## 4 4 0 0 0 0 0.86149410 -1.4378610 1
## 5 5 0 0 0 1 0.01561589 2.7192830 0
## 6 6 2 0 0 0 0.68330030 1.6415770 0
\[\theta_{Y_j} = \frac{\hat{\beta}_{Y_j}}{\hat{\beta}_{X_j}}\]
Where j = 1 in this example.
# Genetic association with the outcome
by1 <- lm(formula = y ~ g1, data = coursedata)$coef[2]
# Genetic association with the exposure
bx1 <- lm(formula = x ~ g1, data = coursedata)$coef[2]
beta.ratio1 <- by1 / bx1
cat("Ratio estimate for g1 =", beta.ratio1)
## Ratio estimate for g1 = -0.06302634
\[se_{first}(\theta_{Y_j}) = \frac{se(\hat{\beta}_{Y_j})}{|\hat{\beta}_{X_j}|}\]
# Standard error of the G-Y association
byse1 <- summary(lm(formula = y ~ g1, data = coursedata))$coef[2, 2]
se.ratio1first <- byse1 / sqrt(bx1^2)
## sqrt(bx1^2) => because se is always positive.
cat("Standard error (first order) of the ratio estimate g1 =", se.ratio1first)
## Standard error (first order) of the ratio estimate g1 = 0.6451987
\[se_{second}(\theta_{Y_j}) = \sqrt{\frac{se(\hat{\beta}_{Y_j})^2}{\hat{\beta}_{X_j}^2} + \frac{\hat{\beta}_{Y_j}^2 se(\hat{\beta}_{X_j})^2}{\hat{\beta}_{X_j}^4}}\]
# Standard error of the G-X association
bxse1 <- summary(lm(formula = x ~ g1, data = coursedata))$coef[2, 2]
se.ratio1second <- sqrt(byse1^2 / bx1^2 + by1^2 * bxse1^2 / bx1^4)
cat("Standard error (second order) of the ratio estimate g1 =", se.ratio1second)
## Standard error (second order) of the ratio estimate g1 = 0.6459625
The F-statistic from the regression of the risk factor on the genetic
variant(s) is used as a measure of ‘weak instrument bias’, with smaller
values suggesting that the estimate may suffer from weak instrument
bias.
For instance, some studies recommend excluding genetic variants if they
have a F-statistic less than 10.
fstat1 <- summary(lm(formula = x ~ g1, data = coursedata))$f[1]
cat("F-stat =", fstat1)
## F-stat = 4.027979
In a case control setting, it is usual to regress the risk factor on the genetic variant in controls only because the case-control sample is generally unrepresentative of the population and if measurements of the risk factor are made post-disease = Avoid possible reverse causation.
# logistic regression for gene-outcome association
by1.bin <- glm(formula = y.bin ~ g1, data = coursedata, family = binomial)$coef[2]
byse1.bin <- summary(glm(formula = y.bin ~ g1, data = coursedata, family = binomial))$coef[2, 2]
# linear regression in the controls only
# bx1.bin <- lm(coursedata$x[coursedata$y.bin==0]~coursedata$g1[coursedata$y.bin ==0])$coef[2]
bx1.bin <- lm(x ~ g1, data = coursedata[coursedata$y.bin == 0, ])$coef[2]
beta.ratio1.bin <- by1.bin / bx1.bin
# beta.ratio1.bin #ratio estimate for g1
cat("Ratio estimate for g1 with binary outcome =", beta.ratio1.bin, "\n")
## Ratio estimate for g1 with binary outcome = 0.2599639
se.ratio1.bin <- byse1.bin / bx1.bin
# se.ratio1.bin #standard error of the ratio estimate for g1
cat("Standard error of the ratio estimate for g1 with binary outcome =", se.ratio1.bin)
## Standard error of the ratio estimate for g1 with binary outcome = 0.5921551
When multiple variants G as Instrumental Variables, TSLS (or 2SLS) is performed by:
\[X \sim G_1 + G_{.\ .\ .} + G_j\]
i.e. regressing the risk factor on all the genetic variants in the same model, and storing the fitted values of the risk factor. And
\[Y \sim \hat{X}\]
i.e. a regression is then performed with the outcome on the fitted values of the risk factor.
By hand we can make the computation as follow (but it is not recommanded !)
by.hand.fitted.values <- lm(x ~ g1 + g2 + g3 + g4, data = coursedata)$fitted
by.hand <- lm(coursedata$y ~ by.hand.fitted.values)
cat("TSLS estimate =", summary(by.hand)$coef[2], "\n")
## TSLS estimate = 0.570785
cat("TSLS standard error =", summary(by.hand)$coef[2, 2])
## TSLS standard error = 0.2016973
Performing two-stage least squares by hand is generally discouraged,
as the standard error in the second-stage of the regression does not
take into account uncertainty in the first-stage regression. The R
package {ivpack} performs the two-stage least
squares method using the ivreg function:
# library(ivpack) ## Depends AER for ivreg
ivmodel.all <- ivreg(y ~ x | g1 + g2 + g3 + g4, x = TRUE, data = coursedata)
# 2SLS estimates
cat("TSLS estimate =", summary(ivmodel.all)$coef[2], "\n")
## TSLS estimate = 0.570785
cat("TSLS standard error =", summary(ivmodel.all)$coef[2, 2])
## TSLS standard error = 0.2291837
cat("F-stat =", summary(lm(x ~ g1 + g2 + g3 + g4, data = coursedata))$f[1])
## F-stat = 10.5824
First stage with a binary outcome is only conducted on controls, second step is done with a logistic regression.
# values for g1 in the controls only
g1.con <- coursedata$g1[coursedata$y.bin == 0]
# values for the risk factor in the controls only
x.con <- coursedata$x[coursedata$y.bin == 0]
# Generate predicted values for all participants based on the linear regression in the controls only :
predict.con.g1 <- predict(lm(x.con ~ g1.con), newdata = list(g1.con = coursedata$g1))
# Fit a logistic regression model on all the participants
tsls1.con <- glm(coursedata$y.bin ~ predict.con.g1, family = binomial)
cat("TSLS estimate for a binary outcome =", summary(tsls1.con)$coef[2], "\n")
## TSLS estimate for a binary outcome = 0.2599639
cat("TSLS standard error for a binary outcome =", summary(tsls1.con)$coef[2, 2])
## TSLS standard error for a binary outcome = 0.5921551
Observation : With a single instrumental genetic variation, TSLS and
Ratio method give identical estimates.
beta.ratio1.bin = summary(tsls1.con)$coef[2]
With multi genetic variants as IV, the TSLS estimate is a weighted average of the ratio estimates.
# g1.con <- coursedata$g1[coursedata$y.bin == 0] # values for g2 in the controls only
g2.con <- coursedata$g2[coursedata$y.bin == 0] # values for g2 in the controls only
g3.con <- coursedata$g3[coursedata$y.bin == 0] # values for g3 in the controls only
g4.con <- coursedata$g4[coursedata$y.bin == 0] # values for g4 in the controls only
predict.con <- predict(
lm(x.con ~ g1.con + g2.con + g3.con + g4.con), # Predicted values
newdata = c(
list(g1.con = coursedata$g1),
list(g2.con = coursedata$g2),
list(g3.con = coursedata$g3),
list(g4.con = coursedata$g4)
)
)
# Logistic regression
tsls1.con.all <- glm(coursedata$y.bin ~ predict.con, family = binomial)
cat("TSLS estimate for a binary outcome =", summary(tsls1.con.all)$coef[2], "\n")
## TSLS estimate for a binary outcome = 0.9410501
cat("TSLS standard error for a binary outcome =", summary(tsls1.con.all)$coef[2, 2])
## TSLS standard error for a binary outcome = 0.3753075
Does ivreg function work for binary outcome ? like
ivmodel.all.bin = ivreg(y.bin~x|g1+g2+g3+g4, x = TRUE)
Answer : So the function works with a binary trait, in that it will run and give you an answer. But it doesn’t treat the trait as special because it is binary and the estimate just represents the absolute change in the exposure as it is a continuous variable. Hence if you want an odds ratio or a relative risk estimate, we suggest to use the two-stage method. Provided that genetic instruments are strong, the overprecision due to not accounting for uncertainty in the first-stage regression is typically small, even negligible.
We can use summarized data (the genetic associations with the risk factor and with the outcome, with their standard errors) to estimate the causal effect of the risk factor on the outcome via the inverse-variance weighted (IVW) method.
To do so, we will use the dataset ratio.all,
ratio.all
## g1 g2 g3 g4
## by -0.008550605 0.2565551 0.2778433 0.17188563
## byse 0.087532274 0.1325076 0.1316129 0.12653807
## bx 0.135667161 0.4938326 0.3475786 0.06788043
## bxse 0.067597577 0.1015321 0.1014761 0.09798360
## beta.ratio -0.063026340 0.5195184 0.7993682 2.53218242
## se.ratio.first 0.645198685 0.2683249 0.3786566 1.86413185
## se.ratio.second 0.645962479 0.2888032 0.4447983 4.10305056
## fstat 4.027979155 23.6566409 11.7321775 0.47993491
## MAF 0.301500000 0.1010000 0.1030000 0.11150000
bx <- ratio.all["bx", ]
by <- ratio.all["by", ]
bxse <- ratio.all["bxse", ]
byse <- ratio.all["byse", ]
and apply following formulas :
\[\hat{\theta}_{IVW} = \frac{\sum_{j=1}^{n} \hat{\theta}_j /var(\hat{\theta}_j)}{\sum_{j=1}^{n} 1 / var(\hat{\theta}_j)}\]
\[var(\hat{\theta}_j) = \frac{var(\beta_{Y_j})}{\beta^2_{X_j}} = \frac{\sigma^2_{Y_j}}{\beta^2_{X_j}}\]
beta.ivw <- sum(bx * by * byse^-2) / sum(bx^2 * byse^-2)
cat("IVW estimate = ", beta.ivw, "\n")
## IVW estimate = 0.567561
se.ivw <- 1 / sqrt(sum(bx^2 * byse^-2))
cat("Standard error of the IVW estimate = ", se.ivw)
## Standard error of the IVW estimate = 0.2060492
In this section, we will provide three different methods that motivate the inverse-variance weighted formula defined above.
First, the IVW method can be motivated by the meta-analysis of the ratio estimates from the individual variants, using the first-order standard errors. Calculate the ratio estimates and first-order standard errors for the four genetic variants:
beta.ratio.all <- t(by / bx)
se.ratio.all <- t(byse / bx)
You can now perform an inverse-variance weighted
meta-analysis using the metagen command from the
{meta} package:
# library(meta)
cat("Perform a meta analyses with Beta and Se : \n")
## Perform a meta analyses with Beta and Se :
metagen(beta.ratio.all, se.ratio.all)
## Number of studies combined: k = 4
##
## 95%-CI z p-value
## Common effect model 0.5676 [0.1637; 0.9714] 2.75 0.0059
## Random effects model 0.5676 [0.1637; 0.9714] 2.75 0.0059
##
## Quantifying heterogeneity:
## tau^2 < 0.0001 [0.0000; 14.8943]; tau = 0.0008 [0.0000; 3.8593]
## I^2 = 0.0% [0.0%; 84.7%]; H = 1.00 [1.00; 2.56]
##
## Test of heterogeneity:
## Q d.f. p-value
## 2.47 3 0.4802
##
## Details on meta-analytical method:
## - Inverse variance method
## - Restricted maximum-likelihood estimator for tau^2
## - Q-Profile method for confidence interval of tau^2 and tau
cat(
"Get TE (Estimate of treatment effect, e.g., log hazard ratio or risk difference) = ",
metagen(beta.ratio.all, se.ratio.all)$TE.fixed, " \n"
)
## Get TE (Estimate of treatment effect, e.g., log hazard ratio or risk difference) = 0.567561
cat(
"Get seTE.fixed (Standard error of treatment estimate) = ",
metagen(beta.ratio.all, se.ratio.all)$seTE.fixed, " \n"
)
## Get seTE.fixed (Standard error of treatment estimate) = 0.2060492
The fixed-effect estimate and standard error using the metagen
command is identical to the IVW estimate and standard error.
A fixed-effect analysis may not be appropriate if there is
heterogeneity between the causal estimates.
Secondly, the IVW method can also be motivated as a ratio estimate using a weighted allele sore as an instrumental variable. We use the estimated genetic associations with the risk factor to create an allele score:
score <- coursedata$g1 * as.numeric(bx[1]) + coursedata$g2 * as.numeric(bx[2]) + coursedata$g3 * as.numeric(bx[3]) + coursedata$g4 * as.numeric(bx[4])
To calculate the ratio estimate and its standard error (first-order) using this score as an instrumental variable
ivmodel.score <- ivreg(coursedata$y ~ coursedata$x | score, x = TRUE)
cat("Get the Instrumental-Variable Regression estimates :\n")
## Get the Instrumental-Variable Regression estimates :
summary(ivmodel.score)
##
## Call:
## ivreg(formula = coursedata$y ~ coursedata$x | score, x = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2324383 -1.3661564 -0.0009812 1.4010821 7.1175494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06953 0.11435 -0.608 0.5433
## coursedata$x 0.56732 0.22897 2.478 0.0134 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.031 on 998 degrees of freedom
## Multiple R-Squared: -0.278, Adjusted R-squared: -0.2792
## Wald test: 6.139 on 1 and 998 DF, p-value: 0.01339
The results of the above model are very similar to those from using
the genetic variants as the instrument.
The difference between the allele score and the fitted values is
approximately the intercept term from the regression of the risk factor
on all the genetic variants.
Thirdly, it is motivated by weighted linear regression of the genetic association estimates, with the intercept set to zero. Use the following code to fit the weighted linear regression model and obtain the causal estimate:
# rotates data to a column vector
BY <- t(by)
BX <- t(bx)
BYSE <- t(byse)
BXSE <- t(bxse)
regression <- lm(BY ~ BX - 1, weights = BYSE^-2)
cat("Get the weighted linear regression estimates :\n")
## Get the weighted linear regression estimates :
summary(regression)
##
## Call:
## lm(formula = BY ~ BX - 1, weights = BYSE^-2)
##
## Weighted Residuals:
## g1 g2 g3 g4
## -0.9774 -0.1790 0.6122 1.0539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## BX 0.5676 0.1871 3.034 0.0561 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9079 on 3 degrees of freedom
## Multiple R-squared: 0.7542, Adjusted R-squared: 0.6723
## F-statistic: 9.205 on 1 and 3 DF, p-value: 0.05613
cat("Get the estimates = ", summary(regression)$coef[1, 1], "\n")
## Get the estimates = 0.567561
cat(
"And divide the standard error by the sigma = ",
summary(regression)$coef[1, 2] / summary(regression)$sigma
)
## And divide the standard error by the sigma = 0.2060492
Why do we divide the standard error by the sigma quantity in the final line of code?
Answer : We want to avoid underdispersion (the estimate being more precise than from a fixed-effect meta-analysis). If sigma, which is the standard deviation of the variant-specific estimates, is less than 1, then the dispersion of the estimates is lower than one would expect due to chance alone (based on the precision of the estimates). However if sigma had been > 1, then the variant-specific estimates are more heterogeneous than one would expect due to chance alone based on the precision of the estimates. We want to allow for over dispersion to make confidence intervals wider, but but not allow underdispersion to make the confidence intervals narrower. So standard error of the causal estimate is divided by sigma (estimate of the residual standard error) to force the residual standard error to be at least one.
N.B. : Run a Weighted regression
lm(betaY ~ betaX - 1, weights = 1/betaYse^2) is actually
equivalent to use mr_input and mr_ivw from the
R package {MendelianRandomization}.
Exacty same effect, Se, pval.
But lm have R-squared : this tell us how much of the variation of the
genetic association with the outcome can be explained by a genetic
association with the risk factor.
The {MendelianRandomization} package was specifically
written to implement Mendelian randomization analyses using summarized
data.
Before using any of the estimation functions, we have to define a
MRInput object to ensure that the data are correctly
formatted.
bx.all <- ratio.all["bx", ]
by.all <- ratio.all["by", ]
bxse.all <- ratio.all["bxse", ]
byse.all <- ratio.all["byse", ]
MRObject <- mr_input(
bx = as.numeric(bx.all),
bxse = as.numeric(bxse.all),
by = as.numeric(by.all),
byse = as.numeric(byse.all)
)
cat("Run an inverse-variance weighted (IVW) model \n")
## Run an inverse-variance weighted (IVW) model
mr_ivw(MRObject) ## inverse-variance weighted (IVW) estimate
##
## Inverse-variance weighted method
## (variants uncorrelated, random-effect model)
##
## Number of Variants : 4
##
## ------------------------------------------------------------------
## Method Estimate Std Error 95% CI p-value
## IVW 0.568 0.206 0.164, 0.971 0.006
## ------------------------------------------------------------------
## Residual standard error = 0.908
## Residual standard error is set to 1 in calculation of confidence interval when its estimate is less than 1.
## Heterogeneity test statistic (Cochran's Q) = 2.4728 on 3 degrees of freedom, (p-value = 0.4802). I^2 = 0.0%.
## Or in a single line of code as
# mr_ivw(mr_input(bx = bx.all, bxse = bxse.all, by = by.all, byse = byse.all))
The estimates above are all the same.
To understand the difference obtained by setting a fixed or random
effet in mr_ivw, we will use the dataset
diabetes_data providing betas and se for several snps.
head(diabetes_data)
## X SNP beta.exposure beta.outcome se.exposure se.outcome
## 1 1 rs10184004 -0.00340017 -0.0121 0.000530106 0.0165
## 2 2 rs10487796 -0.00364702 -0.0237 0.000523323 0.0165
## 3 3 rs10748582 -0.00471516 -0.0029 0.000536941 0.0164
## 4 4 rs10830963 0.00441241 0.0105 0.000582103 0.0201
## 5 5 rs10965247 -0.00718076 -0.0246 0.000682275 0.0223
## 6 6 rs11671304 0.00319588 -0.0094 0.000558102 0.0184
MRObject2 <- mr_input(
bx = diabetes_data$beta.exposure, bxse = diabetes_data$se.exposure,
by = diabetes_data$beta.outcome, byse = diabetes_data$se.outcome
)
The default option is a fixed-effect analysis with 3 or fewer variants, and a random-effects analysis with 4 or more. A fixed effect analysis can be forced using:
mr_ivw(MRObject2, model = "fixed")
##
## Inverse-variance weighted method
## (variants uncorrelated, fixed-effect model)
##
## Number of Variants : 38
##
## ------------------------------------------------------------------
## Method Estimate Std Error 95% CI p-value
## IVW 0.660 0.591 -0.498, 1.818 0.264
## ------------------------------------------------------------------
## Residual standard error = 1.168
## Residual standard error is set to 1 in calculation of confidence interval by fixed-effect assumption.
## Heterogeneity test statistic (Cochran's Q) = 50.4475 on 37 degrees of freedom, (p-value = 0.0692). I^2 = 26.7%.
Note : fixed => no pleiotropic effect
A random effect analysis can be forced using:
mr_ivw(MRObject2, model = "random")
##
## Inverse-variance weighted method
## (variants uncorrelated, random-effect model)
##
## Number of Variants : 38
##
## ------------------------------------------------------------------
## Method Estimate Std Error 95% CI p-value
## IVW 0.660 0.690 -0.692, 2.012 0.339
## ------------------------------------------------------------------
## Residual standard error = 1.168
## Heterogeneity test statistic (Cochran's Q) = 50.4475 on 37 degrees of freedom, (p-value = 0.0692). I^2 = 26.7%.
Note : random => could be pleiotropic effect, but in average is background noise.
The confidence intervals are much larger with random effects.
Note : In a meta-analysis, a fixed-effect analysis means that all the studies estimate the same quantity. A random-effect analysis means that the studies can estimate different quantities, and the result is the average estimate. Formally, we usually assume that the study-specific estimates are normally distributed about a mean value with a given variance.
Warning: loading both the
{MendelianRandomisation} and the {TwoSampleMR}
packages can cause errors as they both include functions called
mr_ivw. We recommend you try to stick to a single package
within any given analysis.
Note: Recent changes to the googleAuthR have broken
the way the {TwoSampleMR} package authenticates; in order
to run this exercise as intended we will need to install an earlier
version of the package. If you are getting errors use this install
command.
install_github('MarkEdmondson1234/googleAuthR@v0.8.1')
While MR-Base can be used on your own data, one of it’s main strength is the speed with which data for exposures and outcomes can be found and combined. Here we download two chosen datasets from the MR-Base website.
library(TwoSampleMR)
# The following objects are masked from 'package:MendelianRandomization':
# mr_ivw, mr_median
# this downloads instruments suitable for Diabetes
exposure_dat <- extract_instruments(c("ukb-b-12948"))
# this finds the data for same snps, but with outcome as Ischemic stroke
outcome_dat <- extract_outcome_data(
exposure_dat$SNP, c("ieu-a-1108"),
proxies = 1, rsq = 0.8, align_alleles = 1,
palindromes = 1, maf_threshold = 0.3
)
# this combines the two data sets,
# automatically aligning snps
# BUT removing any snp that cannot be placed into matching alignment (!)
dat <- harmonise_data(exposure_dat, outcome_dat, action = 2)
# this fits the fixed and random effect ivw models
mr_results <- mr(dat, method_list = c("mr_ivw_mre", "mr_ivw_fe"))
# Note if you need to change back to the MendelianRandomization package
detach("package:TwoSampleMR", unload = TRUE)
For this section of the practical, we use the example data provided
in the {MendelianRandomization} package on the associations
(beta-coefficients and standard errors) of 28 genetic variants with
three risk factors: LDL-cholesterol (ldlc and
ldlcse), HDL-cholesterol (hdlc and
hdlcse), and triglycerides (trig and
trigse); and with one outcome: coronary heart disease
(chdlodds and chdloddsse). The associations
with coronary heart disease (CHD) risk are log odds ratios
(beta-coefficients from a logistic regression analysis).
The summary data for the risk factors and outcome were taken from the
paper by Waterworth et al (2011) Genetic variants influencing
circulating lipid levels and risk of coronary artery disease, doi:
10.1161/atvbaha.109.201020.
The 28 genetic variants in the example dataset reached genome-wide level
of significance (p-value < 5 x 10^-8) with at least one of the three
risk factors (LDL-cholesterol, HDL-cholesterol, and triglycerides) in
the paper by Waterworth et al (2011).
The MR-Egger method can be implemented with as few as 3 variants, however it will generally give imprecise estimates unless there is a good handful of variants. Hence we can use this dataset described above.
library(MendelianRandomization)
# creates mr_input objects
MR_LDLObject <- mr_input(bx = ldlc, bxse = ldlcse, by = chdlodds, byse = chdloddsse)
MR_HDLObject <- mr_input(bx = hdlc, bxse = hdlcse, by = chdlodds, byse = chdloddsse)
# fit some egger and median models
cat("Run the MR Egger model : \n")
## Run the MR Egger model :
mr_egger(MR_LDLObject)
##
## MR-Egger method
## (variants uncorrelated, random-effect model)
##
## Number of Variants = 28
##
## ------------------------------------------------------------------
## Method Estimate Std Error 95% CI p-value
## MR-Egger 3.253 0.770 1.743, 4.762 0.000
## (intercept) -0.011 0.015 -0.041, 0.018 0.451
## ------------------------------------------------------------------
## Residual Standard Error : 1.935
## Heterogeneity test statistic = 97.3975 on 26 degrees of freedom, (p-value = 0.0000)
## I^2_GX statistic: 91.9%
cat("Run the MR median-based method weighted : \n")
## Run the MR median-based method weighted :
mr_median(MR_LDLObject, weighting = "weighted")
##
## Weighted median method
##
## Number of Variants : 28
## ------------------------------------------------------------------
## Method Estimate Std Error 95% CI p-value
## Weighted median method 2.683 0.419 1.862, 3.504 0.000
## ------------------------------------------------------------------
cat("Run the MR median-based method simple : \n")
## Run the MR median-based method simple :
mr_median(MR_LDLObject, weighting = "simple")
##
## Simple median method
##
## Number of Variants : 28
## ------------------------------------------------------------------
## Method Estimate Std Error 95% CI p-value
## Simple median method 1.755 0.740 0.305, 3.205 0.018
## ------------------------------------------------------------------
# graph mr models
cat("Run the all MR method : \n")
## Run the all MR method :
mr_allmethods(MR_LDLObject)
## Method Estimate Std Error 95% CI P-value
## Simple median 1.755 0.740 0.305 3.205 0.018
## Weighted median 2.683 0.419 1.862 3.504 0.000
## Penalized weighted median 2.681 0.420 1.857 3.505 0.000
##
## IVW 2.834 0.530 1.796 3.873 0.000
## Penalized IVW 2.561 0.413 1.752 3.370 0.000
## Robust IVW 2.797 0.307 2.195 3.399 0.000
## Penalized robust IVW 2.576 0.251 2.083 3.069 0.000
##
## MR-Egger 3.253 0.770 1.743 4.762 0.000
## (intercept) -0.011 0.015 -0.041 0.018 0.451
## Penalized MR-Egger 3.421 0.531 2.380 4.461 0.000
## (intercept) -0.022 0.011 -0.043 0.000 0.051
## Robust MR-Egger 3.256 0.624 2.033 4.479 0.000
## (intercept) -0.015 0.021 -0.055 0.026 0.474
## Penalized robust MR-Egger 3.502 0.478 2.566 4.438 0.000
## (intercept) -0.026 0.014 -0.054 0.003 0.075
mr_allmethods(MR_LDLObject, method = "main")
## Method Estimate Std Error 95% CI P-value
## Simple median 1.755 0.740 0.305 3.205 0.018
## Weighted median 2.683 0.419 1.862 3.504 0.000
## IVW 2.834 0.530 1.796 3.873 0.000
## MR-Egger 3.253 0.770 1.743 4.762 0.000
## (intercept) -0.011 0.015 -0.041 0.018 0.451
Scatter plot of genetic associations with the outcome (vertical axis) \(\hat{B_Y}\) against genetic associations with the exposure (horizontal axis) \(\hat{B_X}\).
The lines on each points represent the 95% confidence intervals for the genetic associations with the exposure and with the outcome.
The ratio estimate for each variant is the gradient of the line from the origin to the data point.
The slope of the regression combining n \(\theta\).
For instance using estimation from weighted linear regression.
plot(
x = BX, y = BY,
xlim = c(min(BX - 2 * BXSE, 0), max(BX + 2 * BXSE, 0)),
ylim = c(min(BY - 2 * BYSE, 0), max(BY + 2 * BYSE, 0))
)
for (j in 1:length(BX)) {
lines(x = c(BX[j], BX[j]), y = c(BY[j] - 1.96 * BYSE[j], BY[j] + 1.96 * BYSE[j]))
lines(x = c(BX[j] - 1.96 * BXSE[j], BX[j] + 1.96 * BXSE[j]), y = c(BY[j], BY[j]))
}
abline(h = 0, lwd = 1)
abline(v = 0, lwd = 1)
The function mr_plot for the graphical display of
summarized data has two main modalities.
MRInput object from the
mr_input function.For instance, let’s use MR_LDLObject or
MR_HDLObject :
## plotly
mr_plot(MR_LDLObject)
mr_plot(MR_LDLObject, orientate = TRUE, line = "egger")
mr_plot(MR_LDLObject, interactive = FALSE)
mr_plot(MR_LDLObject, interactive = FALSE, labels = TRUE)
mr_plot(MR_LDLObject, interactive = TRUE, labels = TRUE)
MRAll object from the
mr_allmethods function:mr_plot(mr_allmethods(MR_LDLObject))
mr_plot(mr_allmethods(MR_HDLObject))
mr_plot(mr_allmethods(MR_HDLObject, method = "main"))
mr_plot(mr_allmethods(MR_HDLObject, method = "egger"))
As previously, please read through the help documentation (for
example, type ?mr_plot), and explore the various
functions.
See MR-Base website examples.
{TwoSampleMR} can also produce various plots for the
data :
library(TwoSampleMR)
# scatter plot
mr_results <- mr(dat, method_list = c("mr_ivw_mre", "mr_ivw_fe"))
p1 <- mr_scatter_plot(mr_results, dat) # this creates a scattergraph of the results
p1[[1]]
# single snp plot
res_single <- mr_singlesnp(dat)
p2 <- mr_forest_plot(res_single)
p2[[1]]
# leave one out plot
res_loo <- mr_leaveoneout(dat)
p3 <- mr_leaveoneout_plot(res_loo)
p3[[1]]
# funnel plot # can diag bias in meta-analysis
# visually detect publication bias (selective outcome reporting)
res_single <- mr_singlesnp(dat)
p4 <- mr_funnel_plot(res_single)
p4[[1]]
# Note if you need to change back to the MendelianRandomization package
detach("package:TwoSampleMR", unload = TRUE)
The final paragraph of the vignette on {MendelianRandomization} says:
It is important to remember that the difficult part of a Mendelian randomization analysis is not the computational method, but deciding what should go into the analysis: which risk factor, which outcome, and (most importantly) which genetic variants. Hopefully, the availability of these tools will enable less attention to be paid to the mechanics of the analysis, and more attention to these choices.The note of caution is that tools that make Mendelian randomization simple to perform run the risk of encouraging large numbers of speculative analyses to be performed in an unprincipled way. It is important that Mendelian randomization is not performed in a way that avoids critical thought.
(* This is an important step, but one that you may want to omit here due to time)
I choose body mass index and cigarette smoking : are increases in BMI associated with increases in cigarette smoking in a causal sense?
N.B.: In the package {TwoSampleMR} there is function
available_outcomes()
I found a large GWAS investigation of BMI with 32 significant genetic variants: “Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index” by Elizabeth Speliotes. I got the list of SNPs into R by pasting Table 1 and using a grep command
bmi_snps <- c("rs1558902", "rs2867125", "rs571312", "rs10938397", "rs10767664", "rs2815752", "rs7359397", "rs9816226", "rs3817334", "rs29941", "rs543874", "rs987237", "rs7138803", "rs10150332", "rs713586", "rs12444979", "rs2241423", "rs2287019", "rs1514175", "rs13107325", "rs2112347", "rs10968576", "rs3810291", "rs887912", "rs13078807", "rs11847697", "rs2890652", "rs1555543", "rs4771122", "rs4836133", "rs4929949", "rs206936")
I used PhenoScanner to obtain genetic associations with BMI from GIANT and with cigarettes smoked per day from TAG (both are large GWAS consortia)
# phenoscanner() ## proxies = "None"
mr_obj <- pheno_input(
snps = bmi_snps,
exposure = "Body mass index",
pmidE = "25673413",
ancestryE = "European",
outcome = "Cigarettes per day",
pmidO = "20418890",
ancestryO = "European"
)
## PhenoScanner V2
## Cardiovascular Epidemiology Unit
## University of Cambridge
## Email: phenoscanner@gmail.com
##
## Information: Each user can query a maximum of 10,000 SNPs (in batches of 100), 1,000 genes (in batches of 10) or 1,000 regions (in batches of 10) per hour. For large batch queries, please ask to download the data from www.phenoscanner.medschl.cam.ac.uk/data.
## Terms of use: Please refer to the terms of use when using PhenoScanner V2 (www.phenoscanner.medschl.cam.ac.uk/about). If you use the results from PhenoScanner in a publication or presentation, please cite all of the relevant references of the data used and the PhenoScanner publications: www.phenoscanner.medschl.cam.ac.uk/about/#citation.
##
## [1] "1 -- chunk of 10 SNPs queried"
## [1] "2 -- chunk of 10 SNPs queried"
## [1] "3 -- chunk of 10 SNPs queried"
## [1] "4 -- chunk of 10 SNPs queried"
# mr_obj ## show the whole table
slotNames(mr_obj)
## [1] "betaX" "betaY" "betaXse" "betaYse"
## [5] "exposure" "outcome" "snps" "effect_allele"
## [9] "other_allele" "eaf" "correlation"
# mr_obj@betaX
# mr_obj@betaY # ...
# Other example : with Phenoscanner website
# risk factor : cholesterol (total cholesterol, ldl or hdl)
# outcome : gallstones - gallbladder stones (calculs biliaires)
# library(MendelianRandomization)
library(tidyverse)
data_cholesterol <- readr::read_tsv(file = "./Practicals sessions/Practical5 Hackathon/cholesterol_PhenoScanner_GWAS.tsv")
data_gall <- readr::read_tsv(file = "./Practicals sessions/Practical5 Hackathon/gallstones_PhenoScanner_GWAS.tsv")
my_snp <- intersect(data_cholesterol$rsid, data_gall$rsid)
cat("my IV has", length(my_snp), " variants")
# head(data_cholesterol[data_cholesterol$rsid %in% my_snp, ]) %>%
# as.data.frame()
# data_gall[data_gall$rsid %in% my_snp, ] %>%
# as.data.frame()
my_obj <- pheno_input(
snps = my_snp,
exposure = "cholesterol",
outcome = "gallstones",
pmidE = "24097068",
ancestryE = "European",
pmidO = "17632509",
ancestryO = "Mixed"
)
What if in pmidO we’ve selected a study with
ancestryO = "Mixted" (or other than European) ? Would it be
definitly a mistake ?
Or would it be accepted if we have checked that estimates (Bx and By)
provided in studies “25673413” and “20418890” are well adjusted for
ethnicity (with 5 PCs). In this case, is it ok?
SB Answer: Often the studies with “Mixed” population are +90% European, so often there’s not a big “mistake” even if it is a mistake. In reality though, it’s not a mistake, but it is best to have similarity between the two samples are far as possible. But even under the term “European”, there can be a lot of heterogeneity.
I didn’t perform this step in this case, but one would want to for reliable making inferences
I calculated MR estimates in the {MendelianRandomization} package
mr_ivw(mr_obj)
##
## Inverse-variance weighted method
## (variants uncorrelated, random-effect model)
##
## Number of Variants : 31
##
## ------------------------------------------------------------------
## Method Estimate Std Error 95% CI p-value
## IVW 1.465 0.490 0.504, 2.426 0.003
## ------------------------------------------------------------------
## Residual standard error = 0.955
## Residual standard error is set to 1 in calculation of confidence interval when its estimate is less than 1.
## Heterogeneity test statistic (Cochran's Q) = 27.3475 on 30 degrees of freedom, (p-value = 0.6050). I^2 = 0.0%.
mr_plot(mr_obj, orientate = TRUE)
mr_allmethods(mr_obj, method = "main")
## Method Estimate Std Error 95% CI P-value
## Simple median 1.534 0.729 0.105 2.963 0.035
## Weighted median 1.526 0.704 0.147 2.906 0.030
## IVW 1.465 0.490 0.504 2.426 0.003
## MR-Egger 2.808 1.308 0.243 5.372 0.032
## (intercept) -0.051 0.046 -0.142 0.039 0.268
I didn’t perform additional analyses in this case, but several interesting supplementary analyses could be performed
Assumption of linear effect X on Y. What if this is violated? need access to individual-level data. nice study looking at non-linear effects : https://www.bmj.com/content/364/bmj.l1042
One cohort but 2 sources data : It’s a One sample analysis (not two samples)
Explication about MV IVW Example is like
summary(lm(chdlodds ~x1+x2+x3-1, weights = chdloddsse^-2))
but lm use t-distribution and in {mr_mv} package, normal
hypothesis done, so Normal dist used.
Do you have a simple programm to estimate the genetic risk score?
Sample size calculation with a continuous or binary outcome: https://shiny.cnsgenomics.com/mRnd/
Sample size calculation with a continuous or binary outcome: http://sb452.shinyapps.io/power/
Fieller’s theorem for confidence intervals with a single instrumental variable: http://sb452.shinyapps.io/fieller/
Likelihood-based method (with heterogeneity test) with summarized data: http://sb452.shinyapps.io/summarized/
Perform fast queries in R against a massive database of complete
GWAS summary data with ieugwasr
LDlink website
:
LDlink is a suite of web-based applications designed to easily and
efficiently interrogate linkage disequilibrium in population groups.
Each included application is specialized for querying and displaying
unique aspects of linkage disequilibrium.
MR-Base has a web-app that
allows the fitting of models without using R. May Help to have fast
results, examples of Cohort, Traits, and just few QC
Note :
Choose risk factors, outcomes, …
LD clumping : to not double count the same information, filter correclated SNPs
LD proxies : if in the 2nd dataset I can’t found my SNP, use proxies (1=> super correlated)
Mendelian Randomization course - University of Cambridge
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 11 (bullseye)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] MendelianRandomization_0.7.0 meta_6.2-1
## [3] ivpack_1.2 AER_1.2-10
## [5] survival_3.2-13 sandwich_3.0-2
## [7] lmtest_0.9-40 zoo_1.8-12
## [9] car_3.1-2 carData_3.0-5
## [11] rmarkdown_2.8 targets_0.4.2
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-155 httr_1.4.7 numDeriv_2016.8-1.1
## [4] tools_4.1.3 bslib_0.2.5 utf8_1.2.1
## [7] R6_2.5.0 metafor_4.2-0 lazyeval_0.2.2
## [10] colorspace_2.0-1 withr_2.5.0 tidyselect_1.2.0
## [13] processx_3.5.2 arrangements_1.1.9 compiler_4.1.3
## [16] glmnet_4.1-8 cli_3.4.1 quantreg_5.95
## [19] SparseM_1.81 xml2_1.3.2 plotly_4.10.1
## [22] labeling_0.4.2 sass_0.4.0 scales_1.2.1
## [25] DEoptimR_1.1-3 robustbase_0.99-0 callr_3.7.0
## [28] stringr_1.5.0 digest_0.6.27 minqa_1.2.4
## [31] pkgconfig_2.0.3 htmltools_0.5.6.1 lme4_1.1-33
## [34] highr_0.9 fastmap_1.1.0 htmlwidgets_1.6.2
## [37] rlang_1.1.1 farver_2.1.0 shape_1.4.6
## [40] jquerylib_0.1.4 generics_0.1.0 jsonlite_1.7.2
## [43] crosstalk_1.2.0 dplyr_1.1.2 magrittr_2.0.3
## [46] Formula_1.2-5 metadat_1.2-0 Matrix_1.5-4.1
## [49] Rcpp_1.0.6 munsell_0.5.0 fansi_0.4.2
## [52] abind_1.4-5 lifecycle_1.0.3 stringi_1.7.12
## [55] yaml_2.2.1 CompQuadForm_1.4.3 mathjaxr_1.6-0
## [58] MASS_7.3-60 grid_4.1.3 lattice_0.21-8
## [61] splines_4.1.3 knitr_1.33 ps_1.6.0
## [64] pillar_1.9.0 igraph_1.2.6 boot_1.3-28
## [67] rjson_0.2.20 codetools_0.2-18 iterpc_0.4.2
## [70] glue_1.6.2 evaluate_0.14 data.table_1.14.0
## [73] renv_0.16.0 BiocManager_1.30.15 vctrs_0.6.2
## [76] nloptr_1.2.2.2 foreach_1.5.1 MatrixModels_0.5-1
## [79] gtable_0.3.0 purrr_1.0.1 tidyr_1.3.0
## [82] ggplot2_3.4.3 xfun_0.23 viridisLite_0.4.0
## [85] tibble_3.2.1 iterators_1.0.13 gmp_0.7-2
## [88] ellipsis_0.3.2